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Abstract 


Sample-size requirements were considered for automated essay scoring in cases in which the 
automated essay score estimates the score provided by a human rater. Analysis considered 
both cases in which an essay prompt is examined in isolation and those in which a family 
of essay prompts is studied. In typical cases in which content analysis is not employed and 
in which the only object is to score individual essays to provide feedback to the examinee, 
it appears that several hundred essays are sufficient. For application of one model to a 
family of essays, fewer than 100 essays per prompt may often be adequate. The cumulative 
logit model was explored as a possible replacement of the linear regression model usually 
employed in automated essay scoring; the cumulative logit model performed somewhat 
better than did the linear regression model. 


Key words: Cross-validation, PRESS, residual, regression 



Acknowledgments 

We thank Dan Eignor, David Williamson, Jiahe Qian, and Frank Rijmen for their helpful 
comments; Cathy Trapani and Sailesh Vezzu for their help with the data used in this work; 
and Kim Fryer for her help with proofreading. 


n 



Automated essay-scoring programs such as e-rater® (Attali, Burstein, & Andreyev, 
2003; Burstein, Chodorow, & Leacock, 2004) use samples of essays that have been scored 
by human raters in order to estimate prediction equations in which the dependent variable 
is a human essay score obtained by an examinee on an essay prompt and the independent 
variables are computer-generated essay features for that essay. The prediction equations 
may be applied to predict human scores on essays in which the computer-generated features 
are available but no human scores exist. Prediction may be based on linear regression, as 
is currently the case with e-rater, or may be based on techniques such as cumulative logit 
analysis (Haberman, 2006). 

A basic question to consider with automated essay scoring is the sample size required 
for satisfactory estimation of the prediction equations. Currently, the e-rater V.2 software 
(Attali & Burstein, 2006) requires a sample size of 500 to build the regression model 
and another sample of 500 to cross-validate the regression model. Interestingly, after 
cross-validation, the model is not re-estimated. The processing waits until the pre-assigned 
sample size is reached. The sample-size selection problem may be addressed by use of the 
cross-validation techniques described in Haberman (2006); however, some variations on 
the approach are appropriate when a family of essay prompts is available rather than a 
single essay prompt. In addition, some added variations are reasonable to consider to avoid 
problems with outliers. 

Some of the methodologies discussed in this paper were discussed in Haberman (2006), 
and this paper follows Haberman (2006) in its use of mean-squared error to assess prediction 
quality. This emphasis on mean-squared error reflects common statistical practice with 
linear estimation and reflects the common practice in testing of adding item scores in 
assessments. However, this paper expands the methods of Haberman (2006), provides 
applications of the methods to a wider number of prompts, and discusses the application of 
the methods discussed in Haberman (2006) to a family of essay prompts. 

Section 1. describes the data used in this work. Section 2. considers some basic screens 
for outliers in essay features that may cause distortions in analysis. Section 3. describes 
use of deleted residuals to assess prediction accuracy and to estimate the loss of precision 
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expected for a given sample size. In this section, applications are made to four prompt 
families for several different approaches based on linear regression. In section 4., alternative 
analysis based on cumulative logit models is considered. Some practical conclusions are 
provided in section 5.. Familiarity with standard works on regression analysis (Draper & 
Smith, 1998; Weisberg, 1985) and classical test theory (Lord & Novick, 1968) is helpful in 
reading this report. 


1. Data 

The data used in the analysis are four groups of prompts. Group 1 consists of 74 
prompts associated with a particular licensure test, Group 2 includes 14 prompts from 
a discontinued test of knowledge of English for examinees whose native language is not 
English, Group 3 comprises 26 prompts for a graduate admissions examination, and Group 
4 uses 16 practice prompts for a test of knowledge of English administered to examinees 
whose native language is not English. For each prompt, about 500 essays are available. 

In Group 1, human scores are on a 4-point scale, and only a small number of essays are 
double-scored. Groups 2 and 3 use double-scoring and have a 6-point scale, while Group 4 
has double-scoring and a 5-point scale. In all cases, 1 is the lowest score for a valid essay. 
As a consequence, no ratings out of the range of a valid essay were used in this study. 

In the analysis in this report, predictors used were logarithm of number of discourse 
elements (logdtu), logarithm of average number of words per discourse element (logdta), 
minus the square root of the number of grammatical errors detected per word (nsqg), minus 
the square root of the number of mechanics errors detected per word (nsqm), minus the 
square root of the number of usage errors detected per word (nsqu), minus the square root 
of the number of style errors detected per word (nsqstyle), minus the median Standard 
Frequency Index of words in the essay for which the index can be evaluated (nwfmedian), 
and average word length (wordln2). Features were based on those used in e-rater Version 
7.2 for models without content vector analysis. The signs were selected so that the normal 
sign of the corresponding regression coefficient would be positive. 
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Table 1. 

Frequency of Essays With Fewer Than 25 Words 


Group 

Number 
in group 

Number 
very short 
in group 

Fraction 
very short 
in group 

1 

37,000 

346 

0.0094 

2 

6,384 

0 

0.0000 

3 

12,251 

117 

0.0096 

4 

8,036 

41 

0.0051 


2. Outlier Screens 

It is prudent in any analysis of essays to exclude submissions that are too short to 
be meaningful and those with feature outliers that suggest major typing errors. As in 
Haberman (2006), the rule was adopted that any essay considered must have at least 
25 words. This restriction removes many cases in which essay features exhibit unusual 
behavior. An added rule was adopted that no essay in which the average word length 
exceeded 7.5 characters be considered. The issue here is that such a case is likely to involve 
an error by the writer in using the keyboard. For example, it may occur if the space bar is 
not used properly. 

The restriction on number of words involves an appreciable number of essays, as is 
evident in Table 1. 

Except in Group 1, all human scores for essays with no more than 25 words received 
the lowest possible score. In Group 1, 8 cases of 346 received human scores of 2 rather than 
1. In contrast, for essays with at least 25 words in Group 1, only about 13% received scores 
of 1. 

The restriction on average word length affected very few essays. In the case of Group 
1, two such essays arose, and one essay appeared in Group 3. All received scores of 1 and 
also had no more than 25 words. 

An alternative check on outliers involved an examination of standardized values of 
variables that exceeded 4 in magnitude. The standardization was conducted for all essays 
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for a prompt after removing those essays with no more than 25 words. In examining results, 
it is helpful to note that a feature with a normal distribution would yield standardized 
values of magnitude four or greater with probability 0.00006. Observed rates were somewhat 
higher for most features, as is evident in Table 2; however, the results were not unusual 
for relatively conventional distributions. For example, under a logistic distribution, the 
probability of a standardized value of magnitude at least 4 is 0.0014. For an exponential 
distribution, this probability is 0.0183. The results of regression analysis described in this 
paper provided no compelling reason to remove outliers other than essays with fewer than 
25 words or an average word length above 7.5. In general, outliers must be rather extreme 
before they have significant impact on the analysis in this report. This impact normally will 
be evident through the analysis of variance inflation in Section 3.. Outliers are a concern 
if the estimated variance inflation is much higher than usually encountered for a given 
sample size and number of predictors; however, as already indicated, no case requiring 
consideration of outliers was encountered that is not associated with average word length 
or number of words in the essay. 

3. Sample-Size Determination for the Linear Regression Model 

In the case of linear regression, deleted residuals provide a basic method for assessing 
the accuracy of predicting results of human scoring when applied to data not used to 
estimate regression parameters (Haberman, 2006). In essence, deleted residuals provide an 
approach to cross-validation that requires only minimal computations and provides much 
higher accuracy than primitive approaches in which half a sample is used to construct a 
model and half a sample is used to examine prediction quality. We will first lay out the 
statistical model for essay scoring and provide expressions for several mean-squared errors, 
proportional reductions, and relative increases in mean-squared error that are crucial in 
sample-size determination. We will then discuss how to estimate the above mentioned 
quantities using deleted residuals. 
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Table 2. 

Frequency of Outliers 


Group 

Variable 

Number 
in group 

Number 
of outliers 
in group 

Fraction 
of outliers 
in group 

1 

logdta 

36,697 

43 

0.0012 

1 

logdtu 

36,697 

0 

0.0000 

1 

nwfmedian 

36,697 

104 

0.0028 

1 

nsqg 

36,697 

86 

0.0023 

1 

nsqm 

36,697 

196 

0.0053 

1 

nsqu 

36,697 

104 

0.0028 

1 

nsqstyle 

36,697 

3 

0.0001 

1 

wordln2 

36,697 

31 

0.0008 

2 

logdta 

6,384 

14 

0.0024 

2 

logdtu 

6,384 

0 

0.0000 

2 

nwfmedian 

6,384 

9 

0.0014 

2 

nsqg 

6,384 

15 

0.0024 

2 

nsqm 

6,384 

14 

0.0022 

2 

nsqu 

6,384 

16 

0.0025 

2 

nsqstyle 

6,384 

0 

0.0000 

2 

wordln2 

6,384 

2 

0.0003 

3 

logdta 

12,143 

18 

0.0015 

3 

logdtu 

12,143 

5 

0.0004 

3 

nwfmedian 

12,143 

3 

0.0002 

3 

nsqg 

12,143 

26 

0.0021 

3 

nsqm 

12,143 

28 

0.0023 

3 

nsqu 

12,143 

17 

0.0014 

3 

nsqstyle 

12,143 

0 

0.0000 

3 

wordln2 

12,143 

5 

0.0004 

4 

logdta 

7,997 

6 

0.0008 

4 

logdtu 

7,997 

3 

0.0004 

4 

nwfmedian 

7,997 

17 

0.0021 

4 

nsqg 

7,997 

9 

0.0011 

4 

nsqm 

7,997 

9 

0.0011 

4 

nsqu 

7,997 

7 

0.0009 

4 

nsqstyle 

7,997 

0 

0.0000 

4 

wordln2 

7,997 

6 

0.0008 
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3.1 The Statistical Model 

Consider a random sample of essays i, 0 < i < n. For some positive integer q < n — 1, 
for essay i, let Y VJ be the holistic score provided by reader j, 1 < j < m; > 1, and let 
X, be a (/-dimensional vector with coordinates X lk , 1 < k < q, that are numerical features 
of the essay that have been generated by computer processing. For example, Xu might 
be the observed logdta for essay i, and q might be 8. Let % be the average of the 
1 < j < rrii. Assume that the holistic scores are integers from 1 to G for an integer G > 2, 
and assume that the X lk all have hnite fourth moments and that the covariance matrix 
Cov(X) of Xj is positive-definite. In the simplest cases, m* is a fixed value m for all essays. 
More generally, the rn t are independent random variables that are independent of the X lk 
and Yij and each m, < m for some given integer m > 1. In typical applications, details of 
the rating process are quite limited, so that it is appropriate to assume that independent 
and identically distributed random variables Tj, the true essay scores, exist such that 

Y = T- + e- 

Tj has positive hnite variance cr^i an d the scoring errors are all uncorrelated, have mean 
0, have common positive variances a 2 , and are uncorrelated with T* and the X ik s. 

The assumptions on the errors can be violated if the same rater scores many essays 
from many examinees and if the conditional distribution of e k j depends on the specific rater 
who provides score j for essay i. Because virtually all data involve far fewer raters than 
examinees, the assumptions on the ejj are not entirely innocuous. Without data in which 
raters are identified, it is impossible to investigate the implications of assignment of the 
same rater to many essays, ft appears that the methods used in this report can still be used 
if the probability that two essays receive the same rater is the same for all pairs of essays. 

A further possible violation of assumptions arises when essay features X lk are used that 
depend on properties of essays other than essay i. This issue arises in practice in e-rater 
when content vector analysis is considered. The approach in this report does not apply to 
features associated with content vector analysis (Attali & Burstein, 2006). 

Consider use of ordinary least squares with essays i from 1 to n to estimate the 
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coefficients a and (3k, 1 < k < q, that minimize the mean-squared error 

oi = E(d 2 ) 


( 1 ) 


for 


where 


A. — y _ t* 

U*l 1 l ) 


q 


T* — a + y fikXik- 


k =1 


Let (3 denote the q -dimensional vector with coordinates (3k, 1 < k < q. The estimate a of a 
and the estimates bk of (3k minimize the residual sum of squares 


S r = X/ r i’ 


where 


and 


i= 1 


r =Y— T 


q 

Ti a + ^ ^ bkX ik . 

k= i 


The estimates a and bk are uniquely determined if the sample covariance matrix 
Cov(X) of the X, : , 1 < i < n, is positive definite. In case the estimates are not unique, then 
they may be chosen both to minimize S r and to minimize a 2 + b ; b (Rao & Mitra, 1971, 
p. 51). Linder the above mentioned assumptions on the rater errors e tJ , the mean-squared 
error of 

dn = Ti — T* 


is 


and 


a 2 dT = E([d Tl } 2 ), 


a 2 d = E(Yi — Ti + Ti — T*f = E{T t - T*) 2 + E(Y t - T t ) 2 = a 2 dT + a 2 /m H , 


( 2 ) 


where rrin = l/E(l/rrii) is the harmonic mean of m;. If each m t is m, then E(l/rrii) = 1/m 
(Haberman, 2006). The mean-squared error cr d dehned in Equation 1 is the smallest 
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mean-squared error achievable by linear prediction of Y l by Xj when the joint covariance 
matrix of Y t and X, is known. Similarly, a dT is the smallest mean-squared error achievable 
by linear prediction of the true essay score Tj by Xj. Conditional on the use of M raters, so 
that rrii = M > 0, the smallest mean-squared error achievable by linear prediction of Y t by 

Xj is 

°dM = E{(%\mi = M ) = a\ T + a 2 /M. 

To judge the effectiveness of linear regression, it is often helpful to compare the 
mean-squared error achieved by a trivial prediction of Y t or 7j in which a constant predictor 
a is used. The best choice of a for both cases is E(Yj ) = T(T, : ). The mean-squared error 
for prediction of Y x by E(Y t ) is the variance o‘ Yb of Y tl and the mean-squared error for 
prediction of Tj by T(Tj) is the variance erf of Tj. Clearly 

o Yb = °T + o 2 /m H - (3) 


The proportional reduction of mean-squared error achieved by linear prediction of Y t by T* 
instead of by E(Yj) is then 


PYb = 


a Yb a d 


a 


( 4 ) 


Yb 


In the case of linear prediction of Tj, the proportional reduction of mean-squared error in 
predicting Tj by T* instead of by T’(Tj) is 


2 a T-°d T <T^ b -a 2 /m H -a 2 d + a 2 /m H a 2 , b - a 2 2 2 2 

“ _ ~ =PYb°Yb/ (T T, 


Pt = 


On 


On 


On 


using Equations 2, 3, and 4. Note that erf is less than crf 6 , so that pf exceeds p\ b . 

Conditional on rrij — M > 0, the mean-squared error for prediction of Y t by E(Yi ) is 


o 2 YbM = E([Yi - T(T*)] 2 K = M) = E([Ti - E(T t )} 2 ) + E([Y t - Tj] 2 ) 2 = erf + o 2 /M, 


so that the proportional reduction in mean-squared error in predicting Y t by Xj instead of 
by T(fj) is 


PYbM ~ 


a YbM a dM 


a 


~ PYb a Yb/ a YbM- 


YbM 


The relationship of p Yb to p YbM depends on whether M exceeds rrin- 



3.2 Inflation of Mean-Squared Error 

Cross-validation entails evaluation of the conditional mean-squared error 

r 2 0 = E(rl\Xi, 1 <i<n) 

for prediction of Y 0 by To given the predictors Xj, 1 < i < n. The important issue is that 
r 0 = Yq — T 0 is the prediction error of the average score Y 0 based on the predicted average 
T 0 , where T 0 employs the predictors from essay 0 but has estimates a and b & developed from 
essays 1 to n. Let f = T) — T* be the difference between the estimated best linear predictor 
Ti and the actual best linear predictor T*, and let 

rfo = < i < n). 

It can be shown that 


T ro = °d + T fo > a d- ( 5 ) 

As the sample size n becomes large, standard large-sample arguments as in Box (1954) and 
Gilula and Haberman (1994) can be used to prove that nr 2 0 converges with probability 1 to 

p = + tr([Cov(X)] _1 Cov(dX)), (6) 

where tr is the trace operator and Cov(dX) is the covariance matrix of In the special 

case in which dxi is independent of Xj, p = (q + l)crj (Haberman, 2006), and nrj 0 is p 
whenever the sample covariance matrix of the Xj, 1 < % < n, is positive definite. More 
generally, if C\ is the minimum possible conditional expectation E(d% |Xj) of d 2 given Xj 
and C 2 is the maximum possible conditional expectation T(d 2 |Xj) of d 2 given Xj, then p is 
between (q + l)ci and (q + l)c 2 - Note that for d* independent of Xj, c\ = C 2 = crj, so that 
the general result indeed implies that p = (q + l)crj. 

Consider the relative increase 

I = r 2 Ja 2 d - 1 

in conditional mean-squared error due to parameter estimation (i.e., due to estimation 
of Yq by Tq instead of by Tq). This relative increase, which may be termed inflation in 
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mean-squared error, plays a key role in the methodology considered in this report. The 
relative increase / is (q + 1 )/n if, for each essay i, dn is independent of X,; and X* has 
a positive-definite sample covariance matrix. More generally, with probability 1, nl has 
a limit between (q + l)ci/c 2 and (q + l)c 2 /ci. Note that C\ = c 2 = crj if the standard 
regression assumptions hold, so that the general formula is consistent with the fact that 
nl — q + 1 if the sample covariance matrix of the X*, 1 < i < n, is positive-definite. For 
instance, if q = 8 as in the e-rater example, then a sample size of 360 would be expected to 
yield a relative increase in mean-squared error of 2.5% if standard regression assumptions 
hold and the Xj, 1 < % < n , have a positive-definite covariance matrix. 

Computations of relative increases in mean-squared error must be modified to some 
extent to study the increase prediction error for the true essay score 2% In this case, 
consider the error 

rn = Ti - T-. 


The conditional mean-squared error 

r r 2 T0 = E([r To] 2 |Xj, 1 < i < n) 

is compared to E ( [T t — T*] 2 ) 2 = a 2 T . Because 

2 _ 2 2 

T rT 0 — a dT ~r r /0> 


the relative increase 

It = 'TrTo/°'dT — 1 

is equal to /oJ/aJ T > /. 

Similar arguments can be applied to the relative increase Im in mean-squared error 
for approximation of Y 0 conditional on mo — M >0. Given mo = M, the conditional 
mean-squared error 

r rM 0 = - T 0 ] 2 |x j: , 1 < i < n, m 0 = M) = t 2 tq + <j 2 /M 

for predicting Y 0 by T 0 is compared to the conditional mean-squared error <j^ M f° r prediction 
of Y 0 by Tg. The relative increase in mean-squared error, Im, is defined as 

Im = T rMo/ a dM ~ 1 - 
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3.3 Estimation of rf Using the Predicted Residual Sum of Squares (PRESS) 
Statistic 

Estimation of mean-squared errors may be accomplished by use of deleted residuals 
or by some modification of standard results from the decomposition of sums of squares 
associated with regression analysis. The former approach is simpler to employ in terms of 
exploitation of commonly available software, although the latter approach is more efficient 
computationally. For each essay i, let I(i) be the set of integers 1 to n that are not i. The 
deleted residual dp) (Neter, Kutner, Nachtsheim, & Wasserman, 1996, pp. 372-373) is the 
difference Y{ — Tp), where 

q 

T(i) ap) d~ ^ ^ ^fcp)-^Qfc 

k =1 

and ap) and fr^p) are found by minimizing the sum of squares 

r q 

Tj — ap) — bk{i)Xjk 

jei(i) L k= i 

in which data from essay i are omitted. Computation of dp) involves minimal work, for 

dp) C/(l 

where 

hu — n ~ l + (Xj — X)'C _1 (Xj — X) 

is the ith diagonal element of the hat matrix (Draper & Smith, 1998, pp. 205-207), the 
vector of sample means of essay variables is 

n 

X = n- 1 ^X i , 

%— 1 

and the matrix of corrected sums of cross products is 

n 

C = ^(X l -X)(X i -X) / . 

i= 1 

Deleted residuals are commonly computed by standard software packages such as SAS. 
Given deleted residuals, r^ 0 may be estimated by the PRESS (Neter et al., 1996, pp. 
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345-346) sum of squares 

n 

4 = « -1 XXi)- 

i=l 

The scaled difference n[E(s^ 0 \ Xi, 1 < i < n) — r 2 q] converges to 0 with probability 1, and 
n 1//2 (s 2 o — t 2 0 ) converges in distribution to the variance of d 2 . Alternatively, application of 
an expansion of 1/(1 — ha) 2 shows that t 2 0 can be estimated by 

r 2 0 = si (l + ^J + 2tr([Cov(X)]~ 1 Cov(dX)), (7) 

where 


n 



i= 1 


is the residual sum of squares divided by n, Cov(X) is the sample covariance matrix of 
the Xj, 1 < i < n, and Cov(dX) is the sample covariance matrix of the djXj, 1 < % < n. 
In case Cov(X) is singular, the Moore-Penrose inverse can be used. The scaled difference 
n (4 — 4) converges in probability to 0. 

3.4 Estimation of t 2 T0 and t 2 M0 

Estimation of t 2 T0 may be accomplished if the probability is positive that some rri t 
exceeds 1. If mi is not constant and if the conditional variance of e tJ given T t or X, : is not 
assumed constant, then estimation is more complicated. A consistent estimate of a 2 is 
provided by 

mi 

a 2 = nj 1 - l )" 1 - K;) 2 , 

ieJ j= 1 

where J is the set of integers i, 1 < i < n, with m; > 1, and nj is the number of integers 
i in J. If all m, are equal, then a 2 is just a within-groups mean-squared error from a 
one-way analysis of variance. If J is empty, then a 2 may be set to 0; however, such an 
estimate is obviously not satisfactory. As the sample size n becomes large, b 2 converges 
with probability 1 to a 2 . Given b 2 , r^ T0 , which can be shown to be equal to t 2 0 — a 2 /mu, 
can be estimated by 

2 _ 2 _ 

S rT0 — S r0 a et 
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where 


a 2 e = a 2 /m H 

and rriH is the sample harmonic mean of the rrii, 1 < % < n. One may also substitute t 2 q for 
s 2 0 . In like manner, 

4mo = s Ito + ° 2 / M 

may be used to estimate t 2 M0 , the conditional mean of r\ given m 0 = M. 


3.5 Estimation of o d , o dT , and Inflations of Mean-Squared Error 

As proved in the appendix, an estimate of o d is 

S d = ( S r + S ro)/2- 

The more conventional estimate ns 2 /(n — q — 1) of o d is not appropriate if the residual 
di and predictor Xj are not independent. In general n 1 ^ 2 {s 2 d — of) has the same normal 
approximation as n 1//2 (s^ 0 — r 2 0 ), and n[E(s d \)Ki, 1 < i < n) — of\ converges to 0 with 
probability 1. It then follows that the relative increase / in mean-squared error can be 
estimated by 



It also follows that an estimate of the value of / achieved if n* observations are used rather 
than n is I* = nl/n*. This estimate provides a guide to sample-size selection. Note that if 
the standard regression assumptions hold, then it can be asserted without performing any 
estimation that nl will be close to q + 1. However, in a real application, where one rarely 
knows whether the regression assumptions are true, it is recommended that / be estimated 
by / and that the estimates / and I* guide the process of sample-size selection. 

If mi > 1 with positive probability, then, using Equation 2, may be estimated by 

2 2-2 
SdT = S d ~(T e , 


and It may be estimated by 
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Similar arguments apply to estimation of Im■ The estimate of a dM is s dM = s dT + a 2 /M, so 
that Im has estimate 

2 2 

f S rM 0 — S dM 

J-M ~ -o-• 

S dM 

3.6 Estimation of Proportional Reduction in Mean-Squared Error Using 
Cross- Validation 

Proportional reduction in mean-squared error may also be considered in terms of 
cross-validation, for which results are especially simple when Yq is approximated by the 
sample mean Y of the Y t , 1 < i < n. The error r 0 c = To — Y has mean 0 and variance 

T rOC = a Yb ( 1 + n- 1 ). 

Estimation can be accomplished by use of the conventional estimate 

n 

4„ = (n - I)" 1 

i= 1 

for crf b , so that r 2 0C is estimated by 

TrO C = S rod l + n ~ 1 )- 

Use of deleted residuals results in the estimate 

s roc = [n 2 /(n 2 - 1 )}t 2 0C . 

Thus for estimation of Yq, the proportional reduction 

T 2 - T 2 
2 _ ' rOC 'rO 

rYbO 2 

T r0 C 

in mean-squared error achieved by linear prediction of Y 0 by T 0 instead of by Y is estimated 
by 

s 2 - s 2 
-2 _ *rOC b r 0 

PYbO ~ 2 

b rOC 

As the sample size n increases, p 2 - fe0 converges with probability 1 to pf b , and pf b0 converges 
to py b . Similar approximations are available for p| 0 and p 2 - b0M ■ The mean square of 
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r Toc —Tq — Y is t 2 T0C = r 2 0C — a 2 /mH, and the conditional variance of r 0 c given rri l = M 
is t 2 M0C = t 2 T0C + a 2 /M. The estimate of t 2 T0C is then s 2 T0C = s 2 0C — a 2 /mn, and the 
estimate of t 2 M0C is s 2 T0C + cr 2 /M. Hence, 


„2 _ T rT0C T rT0 

PTO ~ 2 ’ 

'rTOC 

the proportional reduction in mean-squared error achieved by linear prediction of T 0 by T 0 
instead of by Y, is estimated by 


-2 

Pto 


c 2 

'•2 b rOC 
PYbO 2 ’ 

b rT0C 


and 


is estimated by 


„2 _ T rM0C T rM0 

P MO — 2 

‘rMOC 


-2 

Pmo 


o 2 

;2 b rOC 
PYbO 2 

b rM0C 


3.7 A Practitioner’s Guide: What Quantities To Examine in a Real 
Application ? 

Table 3 lists all the residuals, sums of squares, inflations of mean-squared error, and 
proportional reductions described above. An important question, given so many different 
quantities, is the following: Which of these quantities should we use and how in a real 
application? The answers are quite closely linked to answers found in regression analysis, 
and depends on the goal of the user of the methodology. We will discuss three potential 
users and describe the quantities each would be interested in. Estimates and interpretations 
are considered for the first prompt in the second group to illustrate their application. 

User 1: One who wants an idea of the errors when two raters are used. The parameter 
a 2 } measures the mean-squared error for prediction of an average holistic score by observed 
essay features in the case in which the regression coefficients are known. Thus aj is a 
lower bound on the mean-squared error that can possibly be achieved when the regression 
coefficients are estimated from a sample of essays. If crj is regarded as too large for the 
application, then no amount of sampling can lead to a satisfactory linear prediction of the 
average holistic score. For the above mentioned prompt, the estimate s 2 d of cr d may be found 
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Table 3. 

All Residuals, Sums of Squares, Variance Inflations, and Proportional 

Reductions 


Quantity 

Dehnition 

Notes 

di 

y fn* 

1 i ± i 


Ti 

Yi-Ti 


dn 

rri rr\* 

J-i 


rn 

Ti-Ti 


roc 

Y 0 -Y 


r toc 

To — Y 



E(d!j) 

4 = 4r + ° 2 /‘m H 

4r 

E(d 2 Ti ) 


2 

a dM 

E{d%\rrii = M) 

°dM = a dT + ° 2 / M 

4> 

E([Yi — EYi} 2 ) 

Oy b = 4 + (j2 / m H 

2 

a YbM 

E([Yi — EYi] 2 \mi = M) 

°YbM = 4 + ° 2 / M 

r 2 

1 rO 

E(tq X,;, 1 < i < n) 


r 2 

rTO 

E(rf 0 Xj, 1 < i < n) 


r 2 
rM 0 

E(rl\m 0 = M, X*, 1 < i < n) 

4v/o = T vT0 + o*/M 

r 2 

T rOC 

E{rl c ) 

T rOC = 4b( 1 + 1 / n ) 

r 2 

‘rTOC 

Effoc) 

T rT0C = T rOC ~ ° 2 /™H 

r 2 

T rM0C 

E{rl c \mo = M, X*, 1 < i < n) 

T rM0C = T rTOC + CF 2 / M 

I 

44d ~ 1 

RIT'o, To, T 0 * 

It 

T rTo/ a dT ~ 1 

T) T. nr T 1 T 1 * 

ttl. -*0> -*0 ? o 

Im 

2/2 1 

T rM0/ a dM ~ 1 

Given m 0 = M , RI: Y (h T 0 , Tg 

Pvb 

( a Yb ~~ a d)/ a Yb 

PRMSE: Yi, T*, E(Y |) 

Pt 

(4 ~ & dr)/ a T^ 

PRMSE: Tj, 7?, E(t/) 

PYbM 

PYbM = PYb a Ybl a YbM 

Given m 0 = M, PRMSE: % T* , E(Yi) 

PYbO 

( T rOC — T ro)/ T rOC 

PRMSE: Y 0 , f 0 , Y 

Pto 

VrTOC ~ T rTo)/ T rTOC 

PRMSE: T 0 , T 0 , Y 

PmO 

/ 2 2 W 2 

VrMOC ~ T rM0> PrMOC 

Given m 0 = M, PRMSE: Y 0 , f 0 , Y 


Note. “PRMSE: a, b , c” means the proportional reduction in mean-squared error by predic¬ 
tion of a by b compared to prediction of a by c. “RI: d, e, /” means the relative increase in 
mean-squared error, conditional on X*, 1 < i < n, due to estimation of d by e compared to 
estimation of d by /. 
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in the following fashion from regression output from SAS. The estimated root mean-squared 
error 

n 1 1/2 

(n-q- l) _1 ^r? = 0.5725, 

i— 1 

where n is 373 and q is 8. It follows that 

n _ _ 

s 2 r = n-' r i = - —-—-(0.5725) 2 = 0.3198. 

i —1 

The PRESS sum of squares 

2 

E4) = 125 - 5 > 

i=i 

so that 

n 

S r0 = U ~ 1 J2 d l) = 0 - 3364 - 
i= 1 

It follows that 

s 2 d = (s 2 + s 2 r0 )/2 = (0.3198 + 0.3364)/2 = 0.3281. 

By itself, this estimate does not suggest a precise approximation to the average human 
score, for the square root Sa of s 2 d is 0.5728, and the underlying measurements are on a 
6-point scale. Nonetheless, alternatives must be considered to provide a proper perspective. 

The simplest alternative measure is a 2 - b , the mean-squared error for prediction of 
the average holistic score when the mean holistic score is known and no essay features 
are employed in the prediction. The estimate Sy b is the square 0.9496 of the sample 
standard deviation 0.9745 of the % that is reported by SAS. Thus the estimate s 2 d of the 
mean-squared error cr d is somewhat smaller than is the estimate Sy b associated with a trivial 
constant predictor. The coefficient py b then is the proportional reduction in mean-squared 
error achieved by prediction of the average holistic score by use of a regression on essay 
features in which all population means, variances, and covariances are known. The estimate 
of p\ b is 

1 - s\/s\ b = 1 - 0.3281/0.9496 = 0.6545, 

so that the regression analysis is predicted to be much more effective than use of a constant 
predictor. 
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In practice, using a sample of essays to estimate population parameters leads to less 
satisfactory results than are obtained if population parameters are known. Thus t 2 0 is the 
mean-squared error, conditional on the observed essay features in the sample, achieved 
when predicting an average holistic score for an essay not in the sample by use of essay 
features with regression parameters estimated by the observed sampling data. If data 
suggest that t 2 0 is excessive for the application but a d is acceptable, then using a larger 
sample is appropriate. The absolute loss in mean-squared error due to sampling is r 2 0 — a d . 
If the estimated value of cr 2 is acceptably small, then the sample size required for r 2 0 
to have an estimate that is acceptably small can be estimated. In the example, r 2 0 is 
estimated by s 2 0 = 0.3364. As expected, s 2 0 is larger than the estimated mean-squared error 
s 2 d = 0.3281. The loss of precision due to estimation of regression coefficients is quantified 
in the coefficient /, the relative loss in mean-squared error due to sampling. The estimated 
value of / is 

I = (4) - s 2 d )/(s 2 d ) = (0.3364 - 0.3281)/0.3281 = 0.02518, 

so that the relative increase in mean-squared error is about 2.5%. This estimated relative 
increase is not surprising, for (q + 1 )/n = (8 + l)/373 = 0.02413. Such an increase might 
well be considered acceptable. The absolute increase 0.3364 — 0.3281 = 0.008262 also 
appears acceptably small. 

Given that a sample of essays is used to estimate prediction parameters, an added 
measure of interest is p 2 - b0 , the proportional reduction in mean-squared error achieved by 
prediction of average holistic scores by essay features in a sample of essays. The coefficient 
Py b0 is normally less than p\ b , and the difference between the two coefficients provides an 
added measure of the loss of predictive power due to the effects of parameter estimation 
from a sample. For the example, p'y b0 is estimated by 

-2 _ 1 2/2 

PYbO ~ 1 S rO/ S rOCi 

where 

Ti 

s 2 r0C = s 2 y b -—- = 0.9496(373/372) = 0.9522. 
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Thus 


p 2 YW = 1 - 0.3364/0.9522 = 0.6467. 

and the proportional reduction of mean-squared error is about two-thirds. Comparison to 
the estimate of p Yb suggests that the loss due to the estimation of regression coefficients and 
means is relatively small. User 2: One who wants to know what proportion of the error in 
estimation can be attributed to the raters. This user will be interested in the results that can 
be achieved if rater error disappears. This analysis permits the practitioner to distinguish 
between prediction error that is inevitable given rater error and prediction error that results 
from an imperfect relationship between the true essay score and the essay features. Thus 
a dT measures the prediction error of the true essay score by the essay features when all 
needed means, variances, and covariances are known, and t 2 T0 is the corresponding measure 
when regression parameters are estimated from the sample and the essay under study is not 
in the sample. For estimation of these measures, one uses the estimated rater variance 

n 2 

i =1 3 = 1 

Here formulas simplify because each essay has two raters, so that each m t = 2, J is the 
set of integers from 1 to n — 373, and nj = n. The estimate a 2 = 0.1260, so that a 2 , the 
estimated variance of the average rater error ey for essay i, is 0.1260/2 = 0.0630. It follows 
that the mean-squared error a dT has the estimate 

s 2 t = s 2 d - al = 0.3281 - 0.0630 = 0.2651. 

Thus a substantial fraction (0.0630/0.3281=0.192) of the estimated mean-squared error 
s d for prediction of the average holistic score Y t is due to rater variability. In like fashion, 
t 2 t o has the estimate 

s 2 rT0 = s 2 r0 - a 2 = 0.3364 - 0.0630 = 0.2734. 

No matter how large the sample may be, r/ T0 cannot be less than cr dT . The difference 
between r/ T0 and a' dT is the same as the difference between r/ 0 and a d ] however, cr dT is 


19 



less than cr), so that the proportional increase It in mean-squared error for predicting the 
true essay score if sampling is used to estimate regression parameters is greater than the 
corresponding proportional increase / for prediction of the average holistic score. In the 
example, It has estimate 

I T = (0.2734 - 0.2651)/0.2651 = 0.03117, 

so that the inflation of mean-squared error of about 3% is somewhat larger for predicting 
true holistic scores than was the case for predicting average holistic scores. 

The proportional reduction p\ in mean-squared error for predicting true essay score 
by essay features when all means, covariances, and variances are known is larger than the 
corresponding proportional reduction py b in mean-squared error for predicting average 
essay scores by essay features. When sampling is required, the proportional reduction pf 0 
in mean-squared error for predicting true essay scores by essay features is normally smaller 
than pf. A striking aspect of automated essay scoring is that pf and py 0 can be quite high, 
say 0.9. In the example, results are less striking. With use of a constant predictor, the 
estimated mean-squared error for prediction of T 0 is 

s 2 rT0C = s 2 r0C -a 2 e = 0.9522 - 0.0630 = 0.8892, 
so that the estimated proportional reduction in mean-squared error is 

p 2 T o = (0.8892 - 0.2734)/0.8892 = 0.6926. 

Thus the regression analysis has accounted for about 70% of the mean-squared error that is 
not due to rater variability. User 3: One who wants an idea of the errors when one rater 
is employed from the data based on two raters per essay. In some cases, a testing program 
may consider using an automated score in place of some fixed number M of human ratings 
of an essay. Coefficients <r^ M , t 2 M0 , Im, PybMi an d Pmo are provided for this case. They 
are interpreted as in the case of ordinary prediction of the average holistic score given the 
added condition that the number of raters is specified to be M. Illustrations in this report 
use M — 1. For the example, the estimated mean-squared error s\ M for known population 
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characteristics is 


s 2 dT + a 2 = 0.2651 + 0.1260 = 0.3911, 

while the estimated cross-validation mean-squared error derived from deleted residuals is 

s rM0 = s Ito + a 2 = 0.3994. 

Note that, due to use of only one rater rather than two, the predictions of average holistic 
scores are somewhat less accurate here than in the original case of two raters. The 
corresponding estimated cross-validation mean-squared error with a constant predictor is 

'-vmoc = '-’rToc T = 0-8892 + 0.1260 = 1.015, 

so that the estimated proportional reduction in mean-squared error is 

p 2 M o = (1.015 - 0.3994)/1.015 = 0.6066. 

The reduced proportional reduction in mean-squared error for one rather than two raters is 
predictable. It is also predictable that inflation of mean-squared error is reduced compared 
to / and I T for this case. The inflation Im is estimated by 

I M = (0.3994 - 0.3911)/0.3994 = 0.02113. 

Thus the inflation of mean-squared error for one rater is about 2% rather than the 
approximate 2.5% achieved for two raters. 

3.8 Results From Analysis of the Data 

To begin, each essay in the four groups of prompts was analyzed. A summary of results 
is reported in Table 4. Note that results for Group 1 are omitted for entries that rely on 
cr 2 due to the very limited number of essays that have been double-scored. The case of 
M — 1 is considered in the table. Several basic conclusions appear possible, at least for 
these groups of prompts. In typical cases, the means of the estimates nl of inflation of 
mean-squared error are comparable to the ideal value of 9 associated with an intercept 
and 8 predictors (i.e., the inflation of mean-squared error due to estimation is roughly 
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Table 4. 

Summary of Regression Analysis of Individual Essay Prompts Within Groups 


Statistic 

Group 1 
Mean S.D. 

Group 2 
Mean S.D. 

Group 3 
Mean S.D. 

Group 4 
Mean S.D. 

n 

495.9 

6.5 

456.0 

62.0 

467.0 

14.4 

499.8 

6.2 

s 2 

*r0 

0.352 

0.061 

0.367 

0.114 

0.289 

0.053 

0.496 

0.201 

S rOC 

0.871 

0.084 

1.234 

0.255 

1.688 

0.322 

1.215 

0.297 

i 

0.0198 

0.0019 

0.0225 

0.0049 

0.0234 

0.0028 

0.0203 

0.0022 

nl 

9.81 

0.91 

10.05 

1.29 

10.94 

1.18 

10.18 

1.14 

PvbO 

0.593 

0.075 

0.687 

0.131 

0.826 

0.027 

0.602 

0.092 

s 2 

b rT 0 



0.309 

0.121 

0.173 

0.052 

0.361 

0.204 

S 2 

b rT0C 



1.175 

0.261 

1.572 

0.323 

1.080 

0.301 

It 



0.0286 

0.0110 

0.0415 

0.0070 

0.0319 

0.0081 

uIt 



12.69 

3.88 

19.42 

3.69 

15.96 

4.06 

Pto 



0.723 

0.137 

0.888 

0.021 

0.686 

0.119 

4io 



0.426 

0.112 

0.404 

0.058 

0.632 

0.198 

s 2 

'Aioc 



1.292 

0.251 

1.803 

0.322 

1.351 

0.293 

i K 



0.0190 

0.0033 

0.0166 

0.0026 

0.0154 

0.0023 

nil 



12.69 

3.88 

7.74 

1.10 

7.72 

1.16 

P 2 i o 



0.655 

0.128 

0.771 

0.039 

0.537 

0.078 


similar to the amount anticipated if the standard assumptions of regression are valid). 
Typical estimated inflation I of mean-squared error is about 2%, a modest value. Even if 
typical sample sizes were halved to around 250, the inflation would be doubled to around 
4%, a value that could be regarded as tolerable. The available estimates of proportional 
reductions in mean-squared error, pf h() , pf 0 , and p 2 0 , indicate that families of prompts vary 
quite substantially as to how well e-rater predicts human scores, with the best results for 
the second and third groups of prompts. 

3.9 Combining Essays in Groups 

An alternative approach summarizes the data by looking at the prediction of a score 
for a group of essays. The initial approach is to use distinct regression coefficients for each 
prompt, so that K prompts in effect have K(q + 1) predictors. As evident from Table 5, 
results for this approach are quite similar to those for Table 4, except that the proportional 
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Table 5. 

Summary of Regression Analysis of Essay Prompts Within Groups: Distinct 

Coefficients for Each Essay 


Statistic 

Group 1 

Group 2 

Group 3 

Group 4 

n 

36,697 

6,384 

12,143 

7,997 

s 2 

Go 

0.352 

0.366 

0.288 

0.496 

S rOC 

0.892 

1.320 

1.698 

1.355 

i 

0.0198 

0.0217 

0.0235 

0.0202 

nl 

727.62 

138.68 

285.80 

161.45 

Pybo 

0.605 

0.723 

0.830 

0.634 

s 2 

b rT 0 

0.128 

0.307 

0.172 

0.360 

S 2 

\toc 

0.668 

1.262 

1.582 

1.219 

It 

0.0568 

0.0259 

0.0399 

0.0280 

nix 

2,083.56 

165.54 

484.46 

224.04 

Pro 

0.809 

0.756 

0.891 

0.705 

SrlO 

0.366 

0.424 

0.403 

0.632 

s 2 

Gioc 

0.906 

1.378 

1.813 

1.491 

A 

0.0191 

0.0187 

0.0167 

0.0158 

nl\ 

700.56 

119.35 

202.68 

126.20 

Pio 

0.596 

0.693 

0.777 

0.576 


reduction in mean-squared error is increased slightly. This is because it is computed 
relative to a constant predictor for all essays in the entire family rather than relative to a 
constant predictor for each prompt. The number of prompts scored in Group 1 is sufficient 
for analysis related to true scores; however, results for true scores and for exactly one 
score should be approached with caution given that the sampling assumptions appear 
questionable. The increase in the estimated product of sample size by relative inflation of 
mean-squared error primarily reflects the increased number of predictors present in the 
analysis. 

Another approach of interest involves a much smaller number of predictors. A linear 
model is used for each group in which a separate intercept is used for each prompt, but 
the regression coefficients for the predictors are the same for each prompt. Results are 
summarized in Table 6. Relative to use of distinct intercepts and regression slopes for each 
prompt, estimated losses in mean-squared error are very limited (Groups 1, 2, and 4) or 
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Table 6. 

Summary of Regression Analysis of Essay Prompts Within Groups: Distinct 
Intercepts for Each Essay, Common Regression Slopes 


Statistic 

Group 1 

Group 2 

Group 3 

Group 4 

n 

36,697 

6,384 

12,143 

7,997 

s 2 

b r0 

0.360 

0.380 

0.285 

0.503 

S rOC 

0.892 

1.320 

1.698 

1.355 

I 

0.0023 

0.0036 

0.0030 

0.0032 

nl 

82.83 

22.80 

36.03 

25.25 

PYb 0 

0.597 

0.712 

0.832 

0.629 

s 2 

b rT 0 

0.135 

0.322 

0.169 

0.368 

s 2 

b rTOC 

0.668 

1.262 

1.582 

1.219 

It 

0.0060 

0.0042 

0.0050 

0.0043 

nix 

221.50 

26.93 

60.68 

34.62 

Pro 

0.798 

0.745 

0.893 

0.699 

b rlO 

0.373 

0.4384 

0.400 

0.639 

s 2 

*r-10C 

0.906 

1.378 

1.813 

1.491 

A 

0.0022 

0.0031 

0.0021 

0.0025 

nl\ 

79.86 

19.77 

25.63 

19.87 

Pio 

0.588 

0.682 

0.779 

0.571 


nonexistent (Group 3). This approach has much more modest sample-size requirements 
than does the approach with individual regression coefficients for each prompt. If the group 
contains a substantial number of prompts, then a tolerable inflation of mean-squared error is 
obtained with about a tenth of the essays required with individual regression coefficients for 
each prompt. If the standard regression model applies, then the inflation of mean-squared 
error is approximately the number of prompts plus the number of predictors divided by the 
group sample size. In typical applications, the inflation will be approximated by one over 
the number of essays per prompt. 

An even simpler model for a group of essays ignores the prompt entirely, so that the 
same intercepts and the same regression coefficients are applied to each essay in the group. 
Results are summarized in Table 7. Although the inflations of mean-squared error are very 
small, there is a substantial increase in the actual mean-squared error in Group 1 and in 
Group 2. Losses in mean-squared error are also encountered in the other groups, but they 
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Table 7. 

Summary of Regression Analysis of Essay Prompts Within Groups: 
Common Intercepts and Common Regression Slopes 


Statistic 

Group 1 

Group 2 

Group 3 

Group 4 

n 

36,697 

6,384 

12,143 

7,997 

s 2 

Go 

0.398 

0.441 

0.303 

0.558 

S rOC 

0.892 

1.320 

1.698 

1.355 

i 

0.0003 

0.0015 

0.0009 

0.0013 

nl 

9.36 

9.44 

10.78 

10.10 

Pybo 

0.554 

0.666 

0.829 

0.608 

s 2 

b rT 0 

0.173 

0.383 

0.175 

0.396 

S 2 

\toc 

0.668 

1.262 

1.582 

1.219 

It 

0.0006 

0.0017 

0.0015 

0.0017 

nix 

21.50 

10.87 

17.89 

13.56 

Pro 

0.740 

0.697 

0.889 

0.675 

Aho 

0.412 

0.499 

0.406 

0.668 

s 2 

Gioc 

0.906 

1.378 

1.813 

1.491 

A 

0.0002 

0.0013 

0.0006 

0.0010 

nl\ 

9.05 

8.34 

7.72 

8.04 

Pio 

0.546 

0.638 

0.776 

0.552 


are very small in Group 3 and modest in Group 4. The one virtue of the approach with 
a common regression equation for each prompt is that the sample-size requirements for 
the group are similar to those for a single prompt. Thus one could consider use of several 
hundred essays for a complete group. 

4. Sample-Size Determination for a Cumulative Logit Model 

An alternative to linear regression analysis is cumulative logit analysis (Bock, 1973; 
Feng, Dorans, Patsula, & Kaplan, 2003; Haberman, 2006; McCullagh & Nelder, 1989; 
Pratt, 1981). This alternative has the advantage that approximations of Y % and 7) must be 
within the range of possible essay scores. A mild disadvantage is that cross-validation is 
more difficult to perform with commonly available software. 
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4-1 The Statistical Model 

The emulative logit model assumes that, for each essay i, conditional on X,; and rrq, 
the scores Y tJ , 1 < j < rn t , are independent random variables. For unknown parameters rj g 
1 < g < G, and 7 ^, 1 < k < q, the conditional probability P ig that Y ig < g, 1 < g < G, 
given X,;, satisfies the cumulative logit relationship 

<? 

V, = iog(p*/(i - a,)) = v, + X 7 fcAjfc. 

k= 1 

The method of analysis in this section does not assume that the probability model is true, 
just as the regression analysis in the previous section did not assume that the standard 
assumptions of linear regression were valid. Let PiQ = 1 and Pi 0 = 0. The rj g and 7 k 
are unique parameters defined to minimize the expected value of the average logarithmic 
penalty L tl where L t is the average of L ig , 1 < j < m j, and 

Lij = - log[P i9 - P i[g _ i)] 


if Yij = g (Haberman, 1989; Gilula & Haberman, 1994). The cumulative logit model 
is evaluated by considering the mean-squared error from approximation of F* by its 
corresponding approximated expected value 

G G -1 

Fl = d[Pig ~~ Pi(g-1)\ — 1 + 5^(1 — Pig) 

9=1 9=1 

given Xj. Maximum likelihood may be applied to the Xj and Y V]1 1 < j < rrii, for 1 < i < n 
to obtain estimates fj g of g g and 7 *, of 7 ^.. If maximum-likelihood estimates exist and the 
sample covariance matrix of the Xj, 1 < i < n, is nonsingular, then they are uniquely 
defined. Common statistical packages such as SAS may be employed for this purpose. 
Given the parameter estimates, one may estimate Aby 

9 

A ig f] g T ^ ^ 

k=1 


and P ig by 


Pig = [1 + eXp(-Ajg)] 1 . 
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The true score T* and the mean Y, are both estimated by 

G G -1 

TiL = ^2 g[Pig - Pi(g- 1 )] = 1 + ^(1 - Pig)- 
9= 1 9=1 

The rj g and 7 *, are uniquely defined as long as the conditional probabilities P ig are strictly 
increasing in g for fixed i and the covariance matrix of X, : is positive definite. 

An analysis quite similar to that for linear prediction may be considered. The major 
change involves cross-validation. To be consistent with the error criterion used with least 
squares, consider the mean-squared error 

where 

d iL = % - t; l . 

The residual sum of squares is 

n 

S rL = 

1=1 

for 

r%L = Y%~ TiL- 

The mean-squared error of d TiL = — T* L is cr‘j LTL = E{\dxiL ] 2 ), and 

a dL = °dTL + <? 2 /m H . ( 8 ) 

Conditional on the use of M raters, so that rrq — M >0, the conditional mean-squared 
error of T* L as a predictor of Y x is 

°Iml = E(.d 2 iL \mi = M) = a 2 dTL + o 2 jM. 

If the cumulative logit model is correct, then the proposed estimation approach is 
efficient. If the model is not correct, then it may be the case that rf g and Y k can be found 
such that 

E([Yi - r; L } 2 ) < a\ L 
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for 


and 


G G—l 

TL = EsK, - = 1 + D 1 - E). 

9=1 9=1 

Kg = [1 -<'X]>(— 

9 

Kg = r 4 + ^ Ik^ik- 

k= 1 

Given a desire to employ common software in a routine fashion, no attempt has been made 
to exploit this possibility. 

The proportional reduction of mean-squared error achieved by linear prediction of F* 
by T* l instead of by E(Yj) is 


PYbL ~ 


a Yb ~ a dL 


a 


(9) 


Yb 


The proportional reduction of mean-squared error in predicting T % by T* L instead of by 
E (Ti) is 

PTL = Gj l dTL = PYbL°Yb/°T• 

(7 rji 

Because is less than a‘y b , p\ L exceeds p\ bL . Given that rn, = M, the proportional 
reduction in mean-squared error in predicting T t by Y t instead of by E(Ti) is 

PYbML = P\bL a Ybl a YbM- 


The relationship of p\bL t° PYbML depends on whether M exceeds run- 


J f .2 Inflation of Mean-Squared Error 

In the case of cumulative logit analysis, cross-validation entails evaluation of the 
conditional mean-squared error r^ oi = E(r ( j L |Xj, 1 < i < n) for prediction of Y 0 by T 0 l given 
the predictors X, : , 1 < i < n. Let fn = El — T* L be the difference between the estimated 
predictor T iL and the actual predictor T * h , and let r| 0L be the conditional expected value 
^(/o 2 l|X ,,1 <i<n) of the squared deviation f^ L given the predictors X.,;, 1 < i < n. Then 

T rOL = a dL + r fOL > a dL- 

28 



As the sample size n becomes large, standard large-sample arguments similar to those for 
log-linear models (Gilula & Haberman, 1994) show that nr 2 0i converges with probability 1 
to a constant pi- Of interest is the relative increase 


Il = 



- 1 


in conditional mean-squared error due to parameter estimation. This relative increase is of 
order 1/n. 

As in the case of linear regression, computations of relative increases in mean-squared 
error must be modified to some extent to study increase in error of prediction for the 
true essay score T t . The conditional mean-squared error r 2 T0i = E(rf 0 L |Xj,l < i < n) is 
compared to E ( [T t — T * L ] 2 ) = a^ TL . Because 

2 _ 2 , 2 
'rTOL ~ °dTL ~r r /0L> 


the relative increase 

ItL = T rT0L/ cr dTL ~ 1 

is h^jJ^dTL > h- 

Similar arguments can be applied to the relative increase I ml in mean-squared error 
for approximation of Yq conditional on mo — M >0. Given m o = M, the conditional 
mean-squared error 


T rM 0 L = E([Y 0 - T) L ] 2 |Xi, 1 < i < n, m 0 = M) = r r 2 T0L + a 2 /M 

for predicting Y 0 by T 0L is compared to the conditional mean-squared error <J 2 lML for 
predicting Y 0 by Tq L . The relative increase in mean-squared error I ml is defined as 

I ML = T vM0l/ a dML ~ 1- 


4-3 Estimation of Mean-Squared Error 

Estimation of mean-squared errors may be accomplished by use of deleted residuals; 
however, such a step is rather tedious for cumulative logit models if conventional software 
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is used for analysis. An alternative approach may be based on a random division of the 
sample indices 1 to n into nearly equal groups (Haberman, 2006). Let U > 2 be the number 
of groups employed, and let each group have [n/U\ or [n/U\ + 1 members, where [n/U] 
is the largest integer that does not exceed n/U. The accuracy of results is best for larger 
values of U, although computational convenience favors smaller U. Let K u denote the 
collection of indices for group u, let n u be the number of members of K Ul and let T* Lu be 
the estimate of Tj provided by applying the cumulative logit model to observations with 
indices i such that 1 < i < n and i is not in K u . Let r iLu = % — T* Lu be the corresponding 
residual. Let 

n 

2 -1 2 

S rL = n 2-^ r iL, 
i —1 

let 

U n 

Shi = (Unf 1 

u= 1 i=l 

and let 

n 

2 -1 2 

S rL2= n 2-ALL*; 
i= 1 

where = r^u f° r all i in K u for 1 < u < U. With probability 1, the conditional 
expectation of n(s^ L — given Xj, 1 < i < n, converges to a constant . Similarly, with 
probability 1, the conditional expectation of n{s/. L2 — given Xj, 1 < i < n, converges 
to plU/{U — 1), and the conditional expectation of n(s 2 rJA — a 2 L ) given Xj, 1 < i < n, 
converges to + Pl/{U — 1). The arguments required are very similar to those previously 
used with multinomial response models (Gilula & Haberman, 1994). The conditional 
expectation of n(s 2 dL — cr dL ) given X, : , 1 < i < n, and the conditional expectation of 
n ( s rOL ~ T rOL) given Xj, 1 < i < n, both converge to 0 with probability 1. It follows that 
cr dL may be estimated by 

4l = U{s 2 tL - s 2 rL1 ) + s 2 rL 2 

and t/ ol may be estimated by 

2 _ 2 _ 2 | 2 
S rOL — S rL2 ~ S rLl ' S rL~ 
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In large samples, n^ 2 (s 2 r0L — t 2 0L ) and n 1 ^ 2 {s 2 dL — cr dL ) both have approximate normal 
distributions with mean 0 and variance equal to the variance of d 2 L . Again arguments 
similar to those required with log-linear models can be applied (Gilula & Haberman, 1994). 

4-4 Estimation of t 2 T0L and t 2 M0L 

Estimations of t 2 T0L and t 2 M0L are accomplished in a manner similar to that for t 2 T0 
and t 2 m0 . Assume that rri t > 1 with positive probability. Then t 2 T0L may be estimated by 

2 _ 2 _ -2 

S rT0L — S rOL a e 

, and t 2 M0L may be estimated by 

S rMOL = S rT0L + ^ /M. 

4-5 Estimation of & dTL , cr dML and Inflations of Mean-Squared Error 

The estimate of the relative increase A in mean-squared error is now 

II = (s 2 r 0L ~ s dL )/ s dL- 

It also follows that an estimate of the value of II achieved if n* observations are used 
rather than n is = uIl/ti*. As in the regression case, this estimate provides a guide to 
sample-size selection. 

If rn t > 1 with positive probability, then cr dTL may be estimated by 

2 _ 2 -2 

s dTL — s dL a ei 


and Itl may be estimated by 

t _ / 2 2\/2 

J-TL ~ \ s r T0L ~~ s dTL>/ s dTL~ 

Similar arguments apply to estimation of Iml- The estimate of cr dML is s\ ML = s dTL + o 2 /M , 
so that Iml has the estimate 

j _ / 2 2 \ / 2 

IML ~ 1 S rM0L ~~ S dML>/ S dML' 
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4-6 Estimation of Proportional Reduction in Mean-Squared Error Using 
Cross- Validation 

As in the regression case, proportional reduction in mean-squared error may also be 
considered in terms of cross-validation. For estimation of Y 0 , consider the proportional 
reduction 

2 2 

2 _ T rOC ~ T rOL 

PYbO 2 

T rOC 

in mean-squared error. As in regression analysis, r 2 0C is the mean-squared error from 
prediction of Yq by Y. In contrast, r 2 0L is the mean-squared error from prediction of Y by 
T ol . The logical estimate of pf b0 is 


PYbOL ~ 


VOC b rOL 


Voc 


As the sample size n increases, pf b0L converges with probability 1 to p\ bL and pf b0L 
converges to p\- bL - In like manner, consider the proportional reduction 


2 

Ptol 


T rT0C T rT0L 
T 2 

rTOC 


in mean-squared error. Here t 2 T0C , as in linear regression, is the mean-squared error from 
prediction of T 0 by Y. In contrast, t 2 T0L is the mean-squared error from prediction of T 0 by 
Tql. The corresponding estimated proportional reduction in mean-squared error is 


Ptol ~ Pywl o 


rOC 


rTOC 


and 


is estimated by 


„2 _ T rM0C T rMOL 

PmOL — 2 

‘rMOC 


~2 

Pmol 


s 2 

'•2 b rOC 
PYbOL 2 

b rM0C 


4-7 A Practitioner’s Guide to Cumulative Logits 

Application of formulas for cumulative logit analysis is somewhat similar to that 
for linear regression analysis, although a few changes occur when standard software is 
employed. Consider once again the first prompt in the second group of essays. We will 
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discuss results for the same three hypothetical users we considered in the practitioner’s 
guide for the linear regression model. 

First consider User 1, who is interested in the evaluation of performance in prediction 
of the average of the two rater scores. In this case, SAS provides the estimates P ig — A( g -i) 
for g from 1 to G = 6. Use of standard SAS functions permits computation of the estimated 
means 

G G -1 

TiL = ^2 g[P i9 - Pi(g- 1 )] = 1 + 5^(1 - Pig), 

g= 1 9=1 

residuals 

TiL = Yi ~ f lL , 

and squared residuals rf L . The average squared residual s 2 L is found to be 0.3139, a value 
a bit smaller than the corresponding regression estimate of s 2 = 0.3198. To implement 
cross-validation, U = 10 is selected. Using a series of SAS macros and computations 
of variables leads to an average residual for observations not used in model-fitting of 
s rL 2 = 0.3330. The average residual among all observations for all model fits with deleted 
data is s 2 L1 = 0.3141. The estimated value of the conditional mean-squared error t 2 0L for 
predicting a new average holistic score Y 0 by the estimated cumulative logit predictor T 0 l 
is then 

s 2 rQL = s 2 rL2 - s 2 rL1 + s 2 rL = 0.3330 - 0.3141 + 0.3139 = 0.3328, 

a modest improvement over the corresponding regression value of 0.3364. Similarly, one 
may estimate the conditional mean-squared error a 2 lL achieved by prediction of Y t by the 
predictor T* L obtained through knowledge of the joint distribution of Y % and the features 
Xjk, 1 < k < q. The estimate 

s 2 l = U (s 2 l - s 2 rLl ) + s 2 rL2 = 10(0.3139 - 0.3141) + 0.3330 = 0.3238 


is a bit smaller than is s 2 QL . The estimate s 2 dL is also smaller than is the corresponding 
estimate s 2 d = 0.3281 from regression analysis. The estimated proportional reduction in 
mean-squared error 


2 2 

p‘2 _ S rOC ~ S rOL 

HYbOL 2 

b rOC 


0.9496 - 0.3328 
0.9496 


0.6513 
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is slightly larger than the corresponding value Py b0L = 0.6467 from regression analysis. The 
estimated inflation of mean-squared error is 


I r = 


VOL b dL 


0.3328 - 0.3238 


4l 0.3328 


= 0.02548, 


slightly larger than is the corresponding value 0.02518 for regression analysis. 

Next, consider User 2, who is interested in investigating prediction errors that are not 
due to rater variability. Here o 2 dTL measures the error of prediction of the true essay score by 
the essay features when the joint distribution of essay features and holistic scores is known. 
The corresponding measure is t 2 t ol when cumulative logit parameters are estimated from 
the sample and the essay under study is not in the sample. As in the regression case, the 
estimated rater variance a 2 = 0.1260 is used, so that cr 2 , the estimated variance of the 
average rater error e; for essay i, is 0.1260/2 = 0.0630. It follows that the mean-squared 
error <J dTL has estimate 


s 2 dTL = s 2 ^ - ol = 0.3238 - 0.0630 = 0.2608. 


’dL 


As in regression analysis, a substantial fraction (0.0630/0.3238=0.195) of the estimated 
mean-squared error s 2 dL for predicting the average holistic score Y t is due to rater variability. 
In like fashion, t 2 T0L has estimate 


s 2 rT0L = 4)l - o\ = 0.3321 - 0.0630 = 0.2691. 

The difference between s 2 rT0L and the corresponding value s^ T0 = 0.2733 in regression 
analysis reflects the difference between s^ 0L and s^ 0 . As in regression analysis, the inflation 
Itl i n mean-squared error associated with true essay scores has estimate 


Itl — 


0.2733 - 0.2608 
0.2608 


0.03164 


that is larger than II, the corresponding estimated inflation in mean-squared error for 
predicting observed average essay scores. The values of I TL for cumulative logit analysis 
and It = 0.03117 for regression analysis are quite similar. 

As in regression analysis, the proportional reduction p^ L in mean-squared error for 
predicting true essay score by essay features when all joint distributions are known is larger 
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than the corresponding proportional reduction p\ bL in mean-squared error for predicting 
average essay score by essay features. When sampling is required, the proportional 
reduction pf 0L in mean-squared error for predicting true essay score by essay features is 
normally smaller than pf L . The estimated proportional reduction in mean-squared error is 

p 2 T0L = (0.8892 - 0.2691)/0.8892 = 0.6974, 

a value slightly higher than the corresponding estimate pf 0 = 0.6926 from regression 
analysis. 

Finally, consider User 3, who is interested in predicting model performance for 
predicting the performance of a single rater (M = 1) using the data based on two raters per 
essay. Coefficients <J dML , t 2 M0L , I ml, and Py bML are considered here. For the example, the 
estimated mean-squared error s 2 dML for known population characteristics is 

s\ TL + a 2 = 0.2608 + 0.1260 = 0.3868, 

while the estimated cross-validation mean-squared error is 

S rM0L = S rT0L + &~ = 0.3951. 

The estimated proportional reduction in mean-squared error is 

p 2 M0L = (1.015 - 0.3951)/1.015 = 0.6108, 

a value slightly lower than the corresponding regression estimate p 2 M Q = 0.6066. Inflation of 
mean-squared error is similar to that found in the regression case. For cumulative logits, 

I ML = (0.3951 - 0.3868)/0.3951 = 0.02133. 

The comparable value for regression analysis is Im = 0.02113. 

4-8 Results From Analysis of the Data 

Data analysis with cumulative logits was performed in a manner similar to that for 
regression. We used U — 10 in the cross-validation calculations. First, each essay in 
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Table 8. 

Summary of Cumulative Logit Analysis of Individual Essay Prompts Within 

Groups 


Statistic 

Group 1 
Mean S.D. 

Group 2 
Mean S.D. 

Group 3 
Mean S.D. 

Group 4 
Mean S.D. 

n 

495.9 

6.5 

456.0 

62.0 

467.0 

14.4 

499.8 

6.2 

s 2 

*rOL 

0.335 

0.061 

0.350 

0.118 

0.230 

0.039 

0.443 

0.184 

S rOC 

0.871 

0.084 

1.234 

0.255 

1.688 

0.322 

1.215 

0.297 

4 

0.0198 

0.0034 

0.0227 

0.0055 

0.0233 

0.0036 

0.0195 

0.0040 

nI L 

9.60 

1.67 

10.13 

1.83 

10.86 

1.61 

9.74 

1.97 

PvbOL 

0.613 

0.079 

0.699 

0.137 

0.859 

0.033 

0.643 

0.096 

s 2 

b rT0L 



0.292 

0.125 

0.115 

0.036 

0.307 

0.187 

S 2 

\toc 



1.175 

0.261 

1.572 

0.323 

1.080 

0.301 

Itl 



0.0290 

0.0091 

0.0504 

0.0120 

0.0335 

0.0123 

uItl 



12.90 

3.16 

23.6 

5.9 

16.8 

6.1 

Ptol 



0.736 

0.144 

0.926 

0.019 

0.732 

0.120 

Kiol 



0.408 

0.116 

0.345 

0.046 

0.570 

0.178 

S rlOC 



1.292 

0.251 

1.803 

0.322 

1.351 

0.293 

IlL 



0.0191 

0.0050 

0.0153 

0.0026 

0.0144 

0.0034 

nI\L 



8.55 

1.78 

7.15 

1.15 

7.17 

1.64 

pIol 



0.667 

0.134 

0.802 

0.046 

0.574 

0.085 


the four groups of prompts was analyzed. A summary of results is reported in Table 8. 
Several basic conclusions appear possible for these groups of prompts. Cumulative logit 
analysis results in a notable reduction in mean-squared error compared to linear regression. 
Inflation in mean-squared error clue to estimation of parameters is quite comparable to 
that encountered with regression. Thus sample-size recommendations are similar to those 
for regression analysis. The general pattern of relative success of prediction for different 
groups, at least as measured by proportional reduction in mean-squared error, is the same 
as for regression analysis. 

4-9 Combining Essays in Groups 

As in the regression case, an alternative approach summarizes the data by looking 
at the prediction of a score for a group of essays. The initial approach is to use distinct 
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Table 9. 

Summary of Cumulative Logit Analysis of Essay Prompts Within 
Groups:Distinct Coefficients for Each Essay 


Statistic 

Group 1 

Group 2 

Group 3 

Group 4 

n 

36,697 

6,384 

12,143 

7,997 

s 2 

Go 

0.352 

0.366 

0.288 

0.443 

S rOC 

0.892 

1.320 

1.698 

1.355 

i 

0.0194 

0.0225 

0.0232 

0.0196 

nl 

711.15 

143.57 

281.71 

156.78 

Pybo 

0.605 

0.723 

0.856 

0.636 

s 2 

b rT 0 

0.109 

0.292 

0.115 

0.307 

S 2 

\toc 

0.668 

1.262 

1.582 

1.219 

It 

0.0620 

0.0271 

0.0476 

0.0285 

nix 

2,273.86 

172.88 

578.40 

228.12 

Pro 

0.837 

0.769 

0.928 

0.748 

Aho 

0.347 

0.408 

0.345 

0.578 

s 2 

Gioc 

0.906 

1.378 

1.813 

1.491 

A 

0.0186 

0.0192 

0.0153 

0.0149 

nl\ 

683.28 

122.78 

186.20 

119.43 

Pio 

0.617 

0.704 

0.810 

0.612 


parameters for each prompt, so that K prompts involve K(q + G) parameters. As evident 
from Table 9, results for this approach are quite similar to those shown in Table 8, except 
that the proportional reduction in mean-squared error is increased slightly because it is 
computed relative to a constant predictor for all essays in the entire family rather than 
relative to a constant predictor for each prompt. The number of prompts scored in Group 
1 is sufficient for analysis related to true scores; however, results for true scores and for 
exactly one score should be approached with caution given that the sampling assumptions 
appear questionable. The increase in the estimated product of sample size by relative 
inflation of mean-squared error primarily reflects the increased number of predictors present 
in the analysis. 

As with a previous regression analysis, one may consider a model in which, for all 
prompts in a group, the slope for a feature is constant but the intercept is sum of a 
score effect and a prompt effect, so that q + G + K — 1 parameters are used. Results are 
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Table 10. 

Summary of Cumulative Logit Analysis of Essay Prompts Within Groups: 
Distinct Intercepts for Each Essay, Common Slopes 


Statistic 

Group 1 

Group 2 

Group 3 

Group 4 

n 

36,697 

6,384 

12,143 

7,997 

s 2 

b rOL 

0.342 

0.361 

0.231 

0.454 

S 2 

■rOC 

0.892 

1.320 

1.698 

1.355 

II 

0.0023 

0.0034 

0.0036 

0.0027 

nI L 

83.62 

21.49 

42.98 

21.64 

PYbOL 

0.617 

0.726 

0.864 

0.665 

e 2 

b rT0L 

0.117 

0.303 

0.115 

0.318 

s 2 

\toc 

0.668 

1.262 

1.582 

1.219 

Itl 

0.0067 

0.0040 

0.0071 

0.0039 

uItl 

245.53 

25.62 

86.27 

30.91 

Ptol 

0.825 

0.760 

0.927 

0.739 

S rlOL 

0.355 

0.4194 

0.346 

0.590 

s 2 

Gioc 

0.906 

1.378 

1.813 

1.491 

hi. 

0.0022 

0.0029 

0.0024 

0.0021 

nI\L 

80.47 

18.51 

28.62 

16.65 

P 2 iol 

0.608 

0.696 

0.809 

0.604 


summarized in Table 10. Relative to using distinct intercepts and regression slopes for each 
prompt, estimated losses in mean-squared error are very limited (Groups 1, 2, and 4) or 
nonexistent (Group 3). This approach has much more modest sample-size requirements 
than does the approach with individual regression coefficients for each prompt and produce 
a tolerable inflation of mean-squared error with about a tenth of the essays for each prompt. 
Note that this approach does require that the group contains a substantial number of 
prompts. 

The simplest model for a group of essays ignores the prompt entirely, so that only 
G + q parameters are needed. Results are summarized in Table 11. As in the regression 
case, although the inflations of mean-squared error are very small, the tradeoff is a 
substantial increase in the actual mean-squared error in Group 1 and in Group 2. Losses 
in mean-squared error are also encountered in the other groups, but they are very small in 
Group 3 and modest in Group 4. As in the regression case, the virtue of the approach with 
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Table 11. 

Summary of Cumulative Logit Analysis of Essay Prompts Within Groups: 
Common Intercepts and Common Regression Slopes 


Statistic 

Group 1 

Group 2 

Group 3 

Group 4 

n 

36,697 

6,384 

12,143 

7,997 

s 2 

b rOL 

0.384 

0.425 

0.235 

0.482 

S 2 

*rOC 

0.892 

1.320 

1.698 

1.355 

II 

0.0002 

0.0011 

0.0009 

0.0010 

nI L 

9.13 

6.89 

10.94 

7.96 

PYbOL 

0.569 

0.678 

0.862 

0.644 

e 2 

b rT0L 

0.160 

0.367 

0.120 

0.347 

s 2 

\toc 

0.668 

1.262 

1.582 

1.219 

Itl 

0.0006 

0.0013 

0.0018 

0.0014 

nix 

21.96 

7.99 

21.50 

11.08 

Pro 

0.760 

0.709 

0.924 

0.716 

b rlO 

0.398 

0.483 

0.350 

0.618 

s 2 

*rlOC 

0.906 

1.378 

1.813 

1.491 

A 

0.0002 

0.0010 

0.0006 

0.0008 

nl\ 

8.82 

6.07 

7.34 

6.21 

Pio 

0.569 

0.650 

0.807 

0.585 


a common equation for each prompt is that the sample-size requirements for the group are 
similar to those for a single prompt. Thus one could consider use of several hundred essays 
for a complete group. 

Each cumulative logit model performs better than its corresponding regression model, 
making the cumulative logit approach attractive. The gains for the cumulative logit method 
are somewhat variable. 


5. Conclusions 

This paper employs cross-validation methods to assess sample-size requirements both 
for cumulative logit and ordinary regression models. Sample-size requirements depend on 
the application and on the e-rater features used. In typical cases in which content analysis 
is not employed and the only object is to score individual essays to provide feedback to 
the examinee, it appears that several hundred essays are quite sufficient to limit variance 
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inflation to less than 5%. For a large family of essays, fewer than 100 essays per prompt 
may often be adequate. 

Sample-size requirements when content is assessed appear to be much larger (Haberman, 
2006), and proper analysis requires significant modifications to currently used software. 

These recommendations are not appropriate for all potential uses. If e-rater is used 
within an equated assessment or if substantial groups of students are to be compared by 
using e-rater, then sample-size requirements may be much higher. 

For the examples in this report, using common parameters for essay features for all 
prompts within a family appears to be an attractive option, but completely ignoring all 
effects related to prompts appears to be less attractive. Nonetheless, for the third group 
of prompts, ignoring all prompt effects was strikingly successful. Treatment of groups of 
prompts therefore appears to require treatment on a case-by-case basis. 

An important finding in this paper is that the cumulative logit model typically 
performed somewhat better than did ordinary regression analysis. Although cumulative 
logit analysis requires more difficult cross-validation than does ordinary regression analysis, 
the cross-validation is hardly burdensome using standard statistical software. For electronic 
essay scoring, cumulative logit analysis should be considered a very attractive alternative 
to regression analysis. Conceptually, a cumulative logit model makes more sense than an 
ordinary regression model in electronic essay scoring because the observed responses (i.e., 
the essay scores) are categorical, with usually four to six categories. For all the groups 
of essays examined, the average mean-squared error for the cumulative logit model is 
less, often substantially less, than that for the ordinary regression model. In addition, 
the requirements in terms of sample size for the cumulative logit model appear to be 
comparable to those for ordinary regression. Consequently, our research indicates that it 
may be worthwhile to replace the ordinary regression model with a cumulative logit model 
in electronic essay-scoring software packages. 
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Appendix 
;im< 

Equations 5 and 6 suggest that for large n, 


Proof That an Estimate of a d is s d = (s 2 + s 2 0 )/2 


n(r 2 0 - a 2 d ) « cr 2 d + Z, 

where Z = tr([Cov(X)] -1 Cov(dX)). Equation Al suggests that 

<?d ~ (A 2 o ~ Z/njil + l/n)- 1 
~ (A 2 o ~ Z/n)(l-l/n) 

~ A 2 o-A 2 o /n-Z/n. 


By Equation 7, 


°d ~ T r0 - T ro/ n - 2 [A 2 0 - S 2 r (l + 2/n)]- 


Replacing t 2 0 by its estimate s^ 0 , 


4 « + 

o2 I ^2 


(Al) 
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